General suffix automaton construction algorithm and space bounds

نویسندگان

  • Mehryar Mohri
  • Pedro J. Moreno
  • Eugene Weinstein
چکیده

Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2Q− 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U . This bound significantly improves over 2‖U‖−1, the bound given by Blumer et al. (1987), where ‖U‖ is the sum of the lengths of all strings in U . More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O(|S|). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Suffix Tree

SYNONYMS Compact suffix trie DEFINITION The suffix tree S(y) of a non-empty string y of length n is a compact trie representing all the suffixes of the string. The suffix tree of y is defined by the following properties: All branches of S(y) are labeled by all suffixes of y. • • Edges of S(y) are labeled by strings. • Internal nodes of S(y) have at least two children. • Edges outgoing an intern...

متن کامل

Construction of Aho Corasick Automaton in Linear Time for Integer Alphabets

We present a new simple algorithm that constructs an Aho Corasick automaton for a set of patterns, P , of total length n, in O(n) time and space for integer alphabets. Processing a text of size m over an alphabet Σ with the automaton costs O(m log |Σ|+k), where there are k occurrences of patterns in the text. A new, efficient implementation of nodes in the Aho Corasick automaton is introduced, ...

متن کامل

Parallel Construction of Minimal Suffix and Factor Automata

This paper gives optimal parallel algorithms for the construction of the smallest deterministic finite automata recognizing all the suffixes and the factors of a string. The algorithms use recently discovered optimal parallel suffix tree construction algorithms together with data structures for the efficient manipulation of trees, exploiting the well known relation between suffix and factor aut...

متن کامل

Closing in on Time and Space Optimal Construction of Compressed Indexes

Fast and space-efficient construction of compressed indexes such as compressed suffix array (CSA) and compressed suffix tree (CST) has been a major open problem until recently, when Belazzougui [STOC 2014] described an algorithm able to build both of these data structures in O(n) (randomized; later improved by the same author to deterministic) time and O(n/ log σ n) words of space, where n is t...

متن کامل

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. In the literature, the fastest algorithm runs in O(n) time, while it requires O(n log n)-bit working space, where n denotes the length of the text. On the other hand, the most space-efficient algorithm requires O(n)-bit working space while it runs in O(n log n) time. It w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Theor. Comput. Sci.

دوره 410  شماره 

صفحات  -

تاریخ انتشار 2009